Clustering for high-dimension, low-sample size data using distance vectors

نویسنده

  • Yoshikazu Terada
چکیده

In high-dimension, low-sample size (HDLSS) data, it is not always true that closeness of two objects reflects a hidden cluster structure. We point out the important fact that it is not the closeness, but the “values” of distance that contain information of the cluster structure in highdimensional space. Based on this fact, we propose an efficient and simple clustering approach, called distance vector clustering, for HDLSS data. Under the assumptions given in the work of Hall et al. (2005), we show the proposed approach provides a true cluster label under milder conditions when the dimension tends to infinity with the sample size fixed. The effectiveness of the distance vector clustering approach is illustrated through a numerical experiment and real data analysis.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering High Dimension, Low Sample Size Data Using the Maximal Data Piling Distance

We propose a new hierarchical clustering method for high dimension, low sample size (HDLSS) data. The method utilizes the fact that each individual data vector accounts for exactly one dimension in the subspace generated by HDLSS data. The linkage that is used for measuring the distance between clusters is the orthogonal distance between affine subspaces generated by each cluster. The ideal imp...

متن کامل

The distance correlation t-test of independence in high dimension

AMS subject classifications: primary 62G10 secondary 62H20 Keywords: dCor dCov Multivariate independence Distance covariance Distance correlation High dimension a b s t r a c t Distance correlation is extended to the problem of testing the independence of random vectors in high dimension. Distance correlation characterizes independence and determines a test of multivariate independence for rand...

متن کامل

Feature Extraction and Efficiency Comparison Using Dimension Reduction Methods in Sentiment Analysis Context

Nowadays, users can share their ideas and opinions with widespread access to the Internet and especially social networks. On the other hand, the analysis of people's feelings and ideas can play a significant role in the decision making of organizations and producers. Hence, sentiment analysis or opinion mining is an important field in natural language processing. One of the most common ways to ...

متن کامل

Subspace Clustering with Missing Data

1 Subspace clustering with missing data can be seen as the combination of subspace clustering and low rank matrix completion, which is essentially equivalent to high-rank matrix completion under the assumption that columns of the matrix X ∈ Rd×N belong to a union of subspaces. It’s a challenging problem, both in terms of computation and inference. In this report, we study two efficient algorith...

متن کامل

Asymptotic Properties of Distance-Weighted Discrimination

While Distance-Weighted Discrimination (DWD) is an appealing approach to classification in high dimensions, it was designed for balanced data sets. In the case of unequal costs, biased sampling or unbalanced data, there are major improvements available, using appropriately weighted versions of DWD. A major contribution of this paper is the development of optimal weighting schemes for various no...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1312.3386  شماره 

صفحات  -

تاریخ انتشار 2013